Vers une modélisation statistique multi-niveau du langage, application aux langues peu dotées. (Toward a multi-level statistical language modeling for under-resourced language)

نویسنده

  • Sopheap Seng
چکیده

This PhD thesis focuses on the problems encountered when developing automatic speech recognition for under-resourced languages with a writing system without explicit separation between words. The specificity of the languages covered in our work requires automatic segmentation of text corpus into words in order to make the n-gram language modeling applicable. While the lack of text data has an impact on the performance of language model, the errors introduced by automatic segmentation can make these data even less usable. To deal with these problems, our research focuses primarily on language modeling, and in particular the choice of lexical and sub-lexical units, used by the recognition systems. We investigate the use of multiple units in speech recognition system. At language models level, the models are trained with hybrid vocabularies created using both the lexical and the sub-lexical unit. At the system output level, we try to combine the outputs of several recognition systems. Each system is based on a different modeling unit : lexical or sub-lexical. To better exploit the textual data using different views on the same data, we propose a method that performs multiple segmentations on the training corpus instead of a conventional single segmentation. This method based on finite state machines allows generating all possible segmentations from a sequence of characters and then we can extract n-grams to train the language model. It allows finding the n-grams not found by unique segmentation method and adding new n-grams in the language model. We validate these modeling approaches based on multiple units in recognition systems for a group of languages : Khmer, Vietnamese, Thai and Laotian.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A State of the Art of Word Sense Induction: A Way Towards Word Sense Disambiguation for Under-Resourced Languages

______________________________________________________________________________________________ Word Sense Disambiguation (WSD), the process of automatically identifying the meaning of a polysemous word in a sentence, is a fundamental task in Natural Language Processing (NLP). Progress in this approach to WSD opens up many promising developments in the field of NLP and its applications. Indeed, ...

متن کامل

G-OWL : Vers un langage de modélisation graphique, polymorphique et typé pour la construction d'une ontologie dans la notation OWL

Résumé : Le Web Ontology Language (OWL) standardisé par le W3C a pour objectif d’offrir un langage de conception d’ontologies pour le web sémantique. L’ingénierie d’une ontologie est une activité complexe nécessitant une habilité peu accessible à des experts de contenu. En revanche, pour modéliser du contenu métier, la modélisation graphique semi-formelle est une technique souvent employée pour...

متن کامل

Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...

متن کامل

Transformation des contraintes d'intégrité - Des modèles conceptuels vers le relationnel

RÉSUMÉ. Dans un modèle conceptuel, les contraintes d'intégrité représentent une partie intégrante dont la définition est nécessaire pour exprimer aux mieux la sémantique du réel perçu. Toutefois, ces contraintes même si elles sont exprimées au niveau conceptuel, elles sont très souvent ignorées lors du passage vers le niveau logique. En pratique, la majorité des AGL de modélisation ne supporten...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010